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ABSTRACT 

This paper introduces LABurst, a general technique for identify¬ 
ing key moments, or moments of high impact, in social media 
streams without the need for domain-specific information or seed 
keywords. We leverage machine learning to model temporal pat¬ 
terns around bursts in Twitter’s unfiltered public sample stream and 
build a classifier to identify tokens experiencing these bursts. We 
show LABurst performs competitively with existing burst detection 
techniques while simultaneously providing insight into and detec¬ 
tion of unanticipated moments. To demonstrate our approach’s po¬ 
tential, we compare two baseline event-detection algorithms with 
our language-agnostic algorithm to detect key moments across three 
major sporting competitions: 2013 World Series, 2014 Super Bowl, 
and 2014 World Cup. Our results show LABurst outperforms a 
time series analysis baseline and is competitive with a domain- 
specific baseline even though we operate without any domain knowl¬ 
edge. We then go further by transferring LABurst’s models learned 
in the sports domain to the task of identifying earthquakes in Japan 
and show our method detects large spikes in earthquake-related to¬ 
kens within two minutes of the actual event. 

1. INTRODUCTION 

Though researchers have presented many methods for adapting 
social media streams into news sources for journalists or first re¬ 
sponders, many current approaches rely on prior knowledge and 
manual keyword engineering to detect events of interest. While 
straightforward and capable, such approaches are often constrained 
to events one can easily anticipate or describe in very general terms, 
potentially missing impactful but unexpected key moments. For in¬ 
stance, one can follow the frequency of words like “goal” on Twit¬ 
ter during the 2014 World Cup to detect when goals are scored [5], 
but interesting occurrences like penalties or missed goals would 
be missed. One might respond to this weakness by tracking addi¬ 
tional penalty-related tokens, but this approach is untenable in that 
one cannot continually enlarge the keyword set. Furthermore, one 
would still be unable to identify an unexpected moment like Luis 
Suarez’s biting Giorgio Chiellini during the Uruguay-Italy World 
Cup match; who would have thought to include “bite” as a relevant 
token during that event? Relying on predefined keywords also re¬ 
stricts these systems to those languages represented in the seed key¬ 
word set, a significant issue for international events like the World 
Cup. 

Given the sheer volume of social media data (hundreds of thou¬ 
sands of comments, statuses, and photos are generated per minute 
on Facebook alone as of 2012 [21 ]), one could instead forgo seed 
keywords completely and leverage time series analysis to track bursts 
in message volume (as with Vasudevan et al. [25]). Such methods 
gain flexibility of domain but sacrifice semantic information about 


detected events (as one would need to extract keywords causing 
such bursts manually). In this paper, we propose leveraging ma¬ 
chine learning to combine both techniques. 

To explore this integration, we introduce LABurst (for language- 
agnostic burst detection), a general method to model bursts in to¬ 
ken usage in social media streams. The volume of these bursts 
then indicate the presence of a high-impact occurrence or key mo¬ 
ment. In short, the more tokens experiencing a simultaneous burst, 
the higher the impact of that moment. Contrasting with existing 
work, our approach is a streaming algorithm for unfiltered social 
media streams that discovers high-impact moments without prior 
knowledge of the target event and yields a description of the dis¬ 
covered moment. Illustrating this flexibility is a collection of ex¬ 
periments on Twitter’s sample stream surrounding key moments in 
large sporting competitions and natural disasters. These experi¬ 
ments compare LABurst to two existing burst detection methods: a 
time series-based burst detection technique, and a domain-specific 
technique with a pre-determined set of sports-related keywords. 
Results from these experiments demonstrate LABurst’s competi¬ 
tiveness with existing methods. 

This work makes the following contributions: 

• Presents a streaming algorithm and feature set for the dis¬ 
covery and description of impactful and unexpected key mo¬ 
ments in Twitter’s public sample stream without requiring 
manually-defined keywords as input, 

• Demonstrates our approach’s performance is both competi¬ 
tive and flexible, and 

• Transfers sports-trained models to disaster response with 
comparable performance. 

2. RELATED WORK 

Though LABurst focuses on the slightly different problem of dis¬ 
covering interesting moments in social media streams, our work 
shares foundations with classical event detection research. Identi¬ 
fying key events from the ever-growing body of digital media has 
fascinated researchers for over twenty years, starting from digital 
newsprint to blogs and now social media [1]. Early event detection 
research followed that of Fung et al. in 2005, who built on the burst 
detection scheme presented by Kleinberg by identifying bursty key¬ 
words from digital newspapers and clustering these keywords into 
groups to identify bursty events [ ), 9]. This work succeeded in 
identifying trending events and showed such detection tasks are 
feasible. Recognizing that newsprint differs substantially from so¬ 
cial media both in content and velocity, the research community 
began experimenting with new social media sources like blogs, 
but real gains came when microblogging platforms began their rise 
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in popularity. These microblogging platforms include Twitter and 
Sina Weibo and are characterized by constrained post sizes (e.g., 
Twitter constrains user posts to 140 characters) and broadcasting 
publicly consumable information. 

One of the most well-known works in detecting events from mi¬ 
croblog streams is Sakaki, Okazaki, and Matsuo’s 2010 paper on 
detecting earthquakes in Japan using Twitter [23]. Sakaki et al. 
show that not only can one detect earthquakes on Twitter but also 
that it can be done simply by tracking frequencies of earthquake- 
related tokens. Surprisingly, this approach can outperform geolog¬ 
ical earthquake detection tools since digital data propagates faster 
than tremor waves in the Earth’s crust. Though this research is lim¬ 
ited in that it requires pre-specified tokens and is highly domain- 
and location-specific (Japan has a high density of Twitter users, 
so earthquake detection may perform less well in areas with fewer 
Twitter users), it demonstrates a significant use case and the poten¬ 
tial of such applications. 

Along with Sakaki et al., 2010 saw two other relevant papers: 
Lin et al.’s construction of a probabilistic popular event tracker 
[ 5] and Petrovic, Osborne, and Lavrenko’s application of locality- 
sensitive hashing (LSH) for detecting first-story tweets from Twit¬ 
ter streams [18]. Lin’s work demonstrated that the integration of 
non-textual social and structural features into event detection could 
produce real performance gains. Like many contemporary systems, 
however Lin’s models require seeding with pre-specified tokens to 
guide its event detection and concentrates on retrospective per-day 
topics and events. In contrast, Petrovic et al.’s clustering research in 
Twitter avoids the need for seed keywords and retrospective analy¬ 
sis by instead focusing on the practical considerations of clustering 
large streams of data quickly. While typical clustering algorithms 
require distance calculations for all pairwise messages, LSH fa¬ 
cilitates rapid clustering at the scale necessary to support event 
detection in Twitter streams by restricting the number of tweets 
compared to only those within some threshold of similarity. Once 
these clusters are generated, Petrovic was able to track their growth 
over time to determine impact for a given event. This research was 
unique in that it was one of the early methods that did not require 
seed tokens for detecting events and has been very influential, re¬ 
sulting in a number of additional publications to demonstrate its 
utility in breaking news and for high-impact crisis events [ \ 20, 
X. ]. Petrovic’s work and related semantic clustering approaches 
rely on textual similarity between tweets, which limits its abil¬ 
ity to operate in mixed-language environments and differentiates 
LABurst and its language agnosticism. 

Similar to Petrovic, Weng and Lee’s 2011 paper on EDCoW, 
short for Event Detection with Clustering of Wavelet-based Sig¬ 
nals, is also able to identify events from Twitter without seed key¬ 
words [26]. After stringent filtering (removing stop words, com¬ 
mon words, and non-English tokens), EDCoW uses wavelet analy¬ 
sis to isolate and identify bursts in token usage as a sliding window 
advances along the social media stream. Besides the heavy filtering 
of the input data, this approach exhibits notable similarities with the 
language-agnostic method we describe herein with its reliance on 
bursts to detect event-related tokens. These methods, however, op¬ 
erate retrospectively, focusing on daily news rather than breaking 
event detection on which our research focuses. Becker, Naaman, 
and Gravano’s 2011 paper on identifying events in Twitter also fall 
under retrospective analysis, but their findings also demonstrate 
reasonable performance in identifying events in Twitter by lever¬ 
aging classification tasks to separate tweets into those on “real- 
world events” versus non-event messages [2]. Similarly, Diao et 
al. also employ a retrospective technique to separate tweets into 
global, event-related topics and personal topics [' ]. 


Many researchers have explored motivations for using platforms 
like Twitter and have shown interesting dynamics in our behavior 
around events with broad impact. For instance, Lehmann et al.’s 
2012 work on collective attention on Twitter explores hashtags and 
the different classes of activity around their use [ 4]. Their work 
includes a class for activity surrounding unexpected, exogenous 
events, characterized by a peak in hashtag usage with little activity 
leading up to the event, which lends credence to our use of burst 
detection for identifying such events. Additionally, this interest in 
burst detection has led to several domain-specific research efforts 
that also target sporting events specifically [25, 28, 12]. Lanagan 
and Smeaton’s work is of particular interest because it relies almost 
solely on detecting bursts in Twitter’s per-second message volume, 
which we use as inspiration for one of our baseline methods dis¬ 
cussed below. Though naive, this frequency approach is able to 
detect large bursts on Twitter in high-impact events without com¬ 
plex linguist analysis and performs well in streaming contexts as 
little information must be kept in memory. Detecting such bursts 
provide evidence of an event, but it is difficult to gain insight into 
that event without additional processing. LABurst addresses this 
need by identifying both the overall burst and keywords related to 
that burst. 

More recently, Xie et al.’s 2013 paper on TopicSketch seeks to 
perform real-time event detection from Twitter streams “without 
pre-defined topical keywords” by maintaining acceleration features 
across three levels of granularity: individual token, bigram, and 
total stream [27]. As with Petrovic’s use of LSH, Xie et al. lever¬ 
age “sketches” and dimensionality reduction to facilitate event de¬ 
tection and also relies on language-specific similarities. Further¬ 
more, Xie et al. focus only on tweets from Singapore rather than 
the worldwide stream. In contrast, our approach is differentiated 
primarily in its language-agnosticism and its use of the unfiltered 
stream from Twitter’s global network. 

Despite this extensive body of research, it is worth asking how 
event detection on Twitter streams differs from Twitter’s own offer¬ 
ings on “Trending Topics,” which they make available to all their 
users. When a user visit’s Twitter’s website, she is immediately 
greeted with her personal feed as well as a listing of trending topics 
for her city, country, worldwide, or nearly any location she chooses. 
These topics offer insight into the current popular topics on Twit¬ 
ter, but the main differentiating factor is that these popular topics 
are not necessarily connected to specific events. Rather, popular 
memetic content like “#MyLovelifeInMoveTitles” often appear on 
the list of trending topics. Additionally, Twitter monetizes these 
trending topics as a form of advertising [24]. These trending topics 
also can be more high-level than the interesting moments we seek 
to identify: for instance, during the World Cup, particular matches 
or the tournament in general were identified as trending topics by 
Twitter, but individual events like goals or penalty cards in those 
matches were not. It should be clear then that Twitter’s trending 
topics serves a different purpose than the streaming event detection 
described herein. 

3. MOMENT DISCOVERY DEFINED 

This paper demonstrates the LABurst algorithm’s ability to dis¬ 
cover and describe impactful moments from social media streams 
without prior knowledge of the types or domains of these target 
moments. To that end, we first lay LABurst’s foundations by defin¬ 
ing the problem LABurst seeks to solve and presenting the model 
around which LABurst is built. 

3.1 Problem Definition 

Given an unfiltered (though potentially down-sampled) stream 
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S of messages m consisting of various tokens w (where a “to¬ 
ken” is defined as a space-delimited string) 1 , our objective is to 
determine whether each time slice t contains a impactful moment 
and, if so, extract tokens that describe the moment. Identifying 
and describing such moments separately is difficult because, by the 
time one can react to a key moment with a separate analysis tool, 
the moment may have passed. We define a “key moment” here 
as a brief instant in time, lasting on the order of seconds, that a 
journalist would label as “breaking news.” Key moments might 
comprise the highlights of a sporting competition or be the mo¬ 
ment an earthquake strikes, the moment a terrorist attack occurrs, 
or similar. Such moments often generate significant popular inter¬ 
est, affect large populations, or represent an otherwise instrumental 
moment in larger event (e.g., the World Cup). By focusing on these 
instantaneous moments of activity, we also avoid the complexities 
of defining an “event” and the hierarchies among them. 

Formally, we let E denote the set of all time slices t in which 
a key moment occurs. The indicator function 1 E (St,t) takes the 
stream S up to time t and returns a 1 for all times in which an im¬ 
pactful moment occurs, and 0 for all other values of t. We then 
define the moment discovery task as approximating this indicator 
function 1 E {St,t). We also include a function Be (St, t) that re¬ 
turns a set of words w that describe the discovered moment at time 
t if t G E and an empty set otherwise. To account for possible lag 
in reporting the event, typing out a message about the event, and 
the message actually posting to a social media server, we include a 
delay parameter r. This parameter relaxes the task by constructing 
the set E' where, for all t G E, t, t + 1, t + 2,..., t + r G E'. Since 
our evaluation compares methods that share the same ground truth, 
and controlling r affects the ground truth consistently, comparative 
results should be unaffected. In this paper, we use r m2. 

False positives/negatives and true positives/negatives follow in 
the normal way for some candidate function 1 a false 

positive is any time t such that t E ' ( St , t) — 1 and t E > ( St , t) — 0; 
likewise, a false negative is any t such that 1 E '(St,t) = 0 and 
t E ' {St, t) = 1. True positives/negatives follow as expected. 

3.2 The LABurst Model 

In LABurst, we sought to combine the language-agnostic flexi¬ 
bility of burst detection techniques with the specificity of domain- 
specific keyword burst detectors. This integration results from in¬ 
gesting a social media stream, maintaining a sliding window of fre¬ 
quencies for each token contained within the stream, and using the 
number of bursty tokens in a given minute as an indicator of the 
moment’s impact. Critically, these tokens can be of any language 
and are neither stemmed, normalized, or otherwise modified. As an 
example, after a goal is scored in a World Cup match, one would ex¬ 
pect to see many different forms of the word “goal” (both different 
languages and different variations, such as “gooooaaal”) experienc¬ 
ing a burst within a minute of the score. Most other approaches use 
language models to collapse these various token forms, whereas 
LABurst leverages this information as a predictor. 

At a lower level, LABurst runs a sliding window over the in¬ 
coming data stream S and divides it into slices of a fixed number 
of seconds S such that time U — U -1 = S. LABurst then com¬ 
bines a set number uo of these slices into a single window (with an 
overlap of uj — 1 slices), splits each message in that window into 
a set of tokens, and tabulates each token’s frequency. By main¬ 
taining a list of frequency tables from the past k windows up to 
time t (see Figure 1), we construct features describing a token’s 

1 Our use of “token” is more general than a “keyword” as it includes 
numbers, emoticons, hashtags, or web links 


changes in frequency. From these features, we use machine learn¬ 
ing to separate tokens into two classes: bursty tokens B t , and non- 
bursty tokens B*. Following this classification, if the number of 
bursty tokens exceeds some threshold |B t | > p, LABurst flags 
this window at time t as containing a high-impact moment. In this 
manner, LABurst approximates the target indicator function with 
t E '{St,t) = |B t | > p and yields B t as the set of descriptive 
tokens for the given moment. 
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Figure 1: LABurst Sliding Window Model 


To avoid spurious bursts generated by endogenous network phe¬ 
nomena, retweets are discarded since existing literature shows retweets 
propagate extremely rapidly, leading to possible false bursts [11]. 

3.2.1 Temporal Features 

To capture token burst dynamics, we constructed a set of tempo¬ 
ral and graphical features to model these effects, shown in Table 1. 
These features were calculated per token and normalized into the 
range [0,1] to avoid scaling issues. Each feature’s relative impor¬ 
tance was then examined through an ablation study described later. 

3.2.2 LABurst’s Bursty Token Classification 

LABurst’s primary capability is its ability to differentiate be¬ 
tween bursty and non-bursty tokens. To make this determination, 
LABurst integrates these temporal features into feature vectors for 
each token and processes them using an ensemble of known classi¬ 
fication algorithms. Specifically, we use ensembles of support vec¬ 
tor machines (SVMs) [6] and random forests (RFs) [3] integrated 
using AdaBoost [ 8 ]. 

Training these burst detection classifiers, however, requires both 
positive and negative samples of bursty tokens. While obtaining 
positive samples of bursty tokens is relatively straightforward, neg¬ 
ative samples are problematic. For positive samples, we can iden¬ 
tify high-impact, real-world events and construct a set of seed to¬ 
kens that should experience bursts along with the event (as done in 
typical seed-based event detection approaches). Negative samples, 
however, are difficult to identify since one cannot know all events 
occurring around the world at a given moment. To address this dif¬ 
ficulty, we rely on a trick of linguistics and use stop words as neg¬ 
ative samples, our justification being that stop words are in general 
highly used but used consistently (i.e., stop words are intrinsically 
non-bursty). Therefore, in our experiments, we train LABurst on a 
set of events with known bursty tokens and stop words in both En¬ 
glish and Spanish. As this task is semi-supervised, we also include 
a self-training phase to expand our list of bursty tokens. 

4. EVALUATION FRAMEWORK 

Having established the details of our model, we now turn to 
frameworks for evaluating LABurst compared to existing methods. 

To explore such comparisons, we first look to similar methods for 
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Table 1: Features 


Feature 

Description 

Frequency Regres¬ 
sion 

Given the logarithm of a token’s fre¬ 
quency at each window, take the slope of 
the line that best fits this data. This fea¬ 
ture is also duplicated for message fre¬ 
quency and user frequency as well. 

Average Frequency 
Difference 

The difference between the token’s fre¬ 
quency in the most recent window and 
the average frequency across the previ¬ 
ous k — 1 windows. As with the regres¬ 
sion feature, this feature was also cal¬ 
culated for message frequency and user 
frequency. 

Inter-Arrival Time 

The average number of seconds between 
token occurrences in the previous k win¬ 
dows. 

Entropy 

The entropy of the set of messages con¬ 
taining a given token. 

Density 

The density of the @-mention network 
of users who use a given token. 

TF-IDF 

The term frequency, inverse document 
frequency for a each token. 

TF-PDF 

A modified version of TF-IDF called 
term frequency, proportional document 
frequency [4]. 

BursT 

Weight using a combination of a given 
token’s actual frequency and expected 
token frequency [13]. 


detecting interesting events from social media streams and com¬ 
pare their performance relative to LABurst. We then include a sec¬ 
ond experiment to demonstrate LABurst’s domain independence 
and utility in the disaster response context. 

4.1 Accuracy in Event Discovery 

Our first research question is RQ1: is LABurst able to identify 
key moments as well as existing systems? To answer this question, 
we constructed an experiment for enumerating key moments dur¬ 
ing major sporting competitions. Such competitions are interesting 
given their large followings (many fans to post on social media), 
thorough coverage by sports journalists (high-quality ground truth), 
and regular occurrence (large volume of data), making them ideal 
for both data collection and evaluation. Such events are also com¬ 
plex in that they include multiple types of events and unpredictable 
patterns of events around scores, fouls, and other compelling mo¬ 
ments of play. 

Our first step here was to collect data from a number of popu¬ 
lar sporting events and identify key moments in each competition. 
We captured these moments and their times from sports journalism 
articles, game highlights, box scores, blog posts, and social media 
messages. These moments then comprise our ground truth. 

We then introduced a pair of baseline methods: first, a time- 
series algorithm using raw message frequency following the ap¬ 
proaches of Vasudevan et al. and the “activity peak detection” 
method set forth by Lehmann et al. [25, 14], and second, a seed 
keyword-based algorithm in the pattern of Cipriani and Zhao et al. 
[5, 28]. We then evaluate the relative performance for LABurst and 
both baselines as described below. 


4.1.1 Sporting Competitions 

To minimize bias, these competitions covered several different 
sporting types, from horse racing to the National Football League 
(NFL), to Federation Internationale de Football Association (FIFA) 
premier league soccer, to the National Hockey League (NHL), Na¬ 
tional Basketball Assoc. (NBA), and Major League Baseball (MLB). 
Each competition also contained four basic types of events: begin¬ 
ning of the competition, its end, scores, and penalties. Table 2 lists 
the events we identified and the number of key moments in each. 

Table 2: Sporting Competition Data 


Sport 

Key Moments 

Training Data 

2010 NFL Division Championship 

13 

2012 Premier League Soccer Games 

21 

2014 NHL Stanley Cup Playoffs 

24 

2014 NBA Playoffs 

3 

2014 Kentucky Derby Horse Race 

3 

2014 Belmont Stakes Horse Race 

3 

2014 FIFA World Cup Stages A+B 

80 

Testing Data 

2013 MLB World Series Game 5 

7 

2013 MLB World Series Game 6 

8 

2014 NFL Super Bowl 

13 

2014 FIFA World Cup Third Place 

11 

2014 FIFA World Cup Final 

7 

Total 

193 


In 2012, we tracked four Premier League games in November. 
For the 2013 World Series between the Boston Red Sox and the 
St. Louis Cardinals, we covered the final two games on 28 October 
and 30 October of 2013. Likewise, we tracked a subset of play¬ 
off games during the 2014 NHL Stanley Cup and NBA playoffs. 
For the 2014 World Cup, our analysis included a number of early 
matches during stages 1 and 2 and the the final two matches of tour¬ 
nament: the 12 July match between the Netherlands and Brazil for 
third place, and the final match on 13 July between Germany and 
Argentina for first place. 

These events were split into training and testing sets; training 
data covered the 2010 NFL championship, 2012 premier league 
soccer games, NHL/NBA playoffs, the Kentucky Derby/Belmont 
Stakes horse races, and several days of World Cup matches in June 
of 2014. The testing data covered the 2013 MLB World Series, 
2014 NFL Super Bowl, and the final two matches of the 2014 FIFA 
World Cup. 


4.1.2 Burst Detection Baselines 


The LABurst algorithm straddles the line between time-series 
analysis and token-centric burst detectors. Therefore, to evaluate 
LABurst properly, we implemented two baselines for comparison. 
The first baseline, to which we refer as RawBurst, uses a known 
method for detecting bursts by taking the difference between the 
number of messages seen in the current time slice and the average 
number of messages seen over the past k time slices [25, 14]. 

Formally, we define a series of time slices t £ T segmented into 
5 seconds and a social media stream S containing messages m such 
that S t contains all messages in the stream between t — 1 and t. We 
then define the frequency of a given time slice t as freq(t, S) — \ St \ 
and the average over the past k time slices as avg (k, t , S), shown 
inEq. 1. 


avg (k, t, S) 


EU freq(j, S) 
k 


( 1 ) 
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Given these functions, we take the difference A t ,k between the fre¬ 
quency at time t and the average over the past k slices such that 
A t ,k — freq(t, S ) — avg (k : t, S). If this difference exceeds some 
threshold p such that A t ,k > p, we say an event was detected at 
time t. 

Following those like Cipriani from Twitter’s Developer Blog and 
others, we then modify the RawBurst algorithm to detect events us¬ 
ing frequencies of a small set of seed tokens w E W, to which we 
will refer as TokenBurst [ ]. To convert RawBurst into TokenBurst, 
we modify the freq(t, S) function to return the summed frequency 
of all seed tokens, as shown in Eq. 2 where countfw, St) returns 
the frequency of token w in the stream S during time slice t. These 
seed tokens are chosen such that they likely exhibit bursts in usage 
during the key moments of our sporting event data, such as “goal” 
for goals in soccer/football or hockey or “run” for runs scored in 
baseball. This TokenBurst implementation also includes some rudi¬ 
mentary normalization to collapse modified words to their originals 
(e.g., “gooaallll” to “goal”). Many existing stream-based event de¬ 
tection systems use just such an approach to track specific types of 
events. 

freq(t, S) m ^ count(u;, St) (2) 

w£W 

Since our analysis covers three separate types of sporting com¬ 
petitions, seed keywords should include tokens from vocabularies 
of each. We avoid separate keyword lists for each sport to provide 
an even comparison to the general nature of our language-agnostic 
technique. The tokens for which we searched are shown in Table 
3. We also used regular expressions to collapse deliberately mis¬ 
spelled tokens to their normal counterparts. 

Table 3: Predefined Seed Tokens 


Sport 

Tokens 

World Series 

run, home, homerun 

Super Bowl 

score, touchdown, td, fieldgoal, points 

World Cup 

goal, gol, golazo, score, foul, penalty, 
card, red, yellow, points 


4.1.3 Performance Evaluation 

Having defined LABurst, RawBurst, and TokenBurst, we evalu¬ 
ate these algorithms by constructing a series of receiver operating 
characteristic (ROC) curves across test sets of our sports data. We 
then evaluate relative performance between the approaches by com¬ 
paring their respective areas under the curves (AUCs) by varying 
the threshold parameters for each method. In RawBurst and To¬ 
kenBurst, this threshold parameter refers to p in A t ,k > P- For our 
LABurst method, the ROC curve is generated by varying the min¬ 
imum p in t E > (St, t) — |B t | > p. The AUC of the ROC curve 
is useful because it is robust against imbalanced classes, which we 
expect to see in such an event detection task. Then, by comparing 
these AUC values, we can provide an answer to RQ1. 

4.2 Evaluating Domain Independence 

Beyond LABurst’s ability to discover and describe interesting 
moments, we also claim it to be domain independent. To justify 
this claim, we must answer our second research question RQ2: can 
LABurst transfer models learned in one context to another one sep¬ 
arate from its training domain and remain competitive? 

Detecting key moments within sporting competitions as described 
above is a useful task for areas like advertising or automated high¬ 
light generation, but a more compelling and worthwhile task would 


be to detect higher-impact events like natural disasters. The typi¬ 
cal seed-token-based approach is difficult here as it is impossible 
to know what events are about to happen where, and a list of target 
keywords to detect all such events would be long and lead to false 
positives. LABurst could be highly beneficial here as one need not 
know details like event location, language, or type. This context 
presents an opportunity to evaluate LABurst in a new domain and 
compare it to existing work by Sakaki, Okazaki, and Matsuo [ 3]. 
Thus, to answer RQ2, we can take the LABurst model as trained 
on sporting events presented for RQ1 and apply them directly to 
this context. 

For this earthquake detection task, we compare LABurst with the 
TokenBurst baseline using the keyword “earthquake,” as in Sakaki, 
Okazaki, and Matsuo. Also following Sakaki et al., we target earth¬ 
quakes in Japan over the past two years and select two of the most 
severe: the 7.1-magnitude quake off the coast of Honshu, Japan 
on 25 October 2013, and a 6.5-magnitude quake off the coast of 
Iwaki, Japan on 11 July 2014. Rather than generating ROC curves 
for this comparison, we take a more straightforward approach and 
compare lag between the actual earthquake event and the point in 
time in which the two methods detect the earthquake. If the lag be¬ 
tween TokenBurst and LABurst is sufficiently small, we will have 
good evidence for an affirmative answer to RQ2. 

5. DATA COLLECTION 

While the algorithms described herein are general and can be ap¬ 
plied to any sufficiently active social media stream, the ease with 
which one can access and collect Twitter data makes it an attrac¬ 
tive target for our research. To this end, we leveraged two existing 
Twitter corpora and created our own corpus of tweets from Twit¬ 
ter’s 1% public sample stream 2 . This new corpus was created using 
the twitter-tools library 3 developed for evaluations at the NIST Text 
Retrieval Conferences (TRECs). In collecting from Twitter’s public 
sample stream, we connect to the Twitter API endpoint (provide no 
filters), and retrieve a sampling of 1% of all public tweets, which 
yields approximately 4,000 tweets per minute. 

The two existing corpora we used were the Edinburgh Corpus 
[19], which covered the 2010 NFL division championship game, 
and an existing set of tweets pulled from Twitter’s firehose source 
targeted at Argentina during November of 2012, which covered the 
four Premier League soccer games. All remaining data sets were 
extracted from Twitter’s sample stream over the course of October 
2013 to July 2014. 

Where possible, for each event (both sporting and earthquake), 
we recorded all tweets from the 1% stream starting an hour before 
the target event and ending an hour after the event, yielding over 15 
million tweets. Table 4 shows the breakdown of tweets collected 
per event. From these tweets, we extracted 1,109 positive (i.e., 
known bursty) samples and 43,037 negative samples for a total of 
44,146 data points. 

6. EXPERIMENTAL RESULTS 
6.1 Setting Model Parameters 

Prior to carrying out the experiments described above, we first 
needed appropriate parameters for window sizes and LABurst’s 
classifiers. For LABurst’s slice size 5, window size u, and k pre¬ 
vious window parameters, preliminary experimentation yielded ac¬ 
ceptable results with the following: A = 60 seconds, uj — 180 

2 https://dev.twitter.com/streaming/reference/get/statuses/sample 

3 https ://github.com/lintool/twitter-tools 
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Table 4: Per-Event Tweet Counts 


Event 

Tweet Count 

Training Data 

2010 NFL Division Championship 

109,809 

2012 Premier League Soccer Games 

1,064,040 

2014 NHL Stanley Cup Playoffs 

2,421,065 

2014 NBA Playoffs 

500,170 

2014 Kentucky Derby Horse Race 

233,172 

2014 Belmont Stakes Horse Race 

226,160 

2014 FIFA World Cup Stages A+B 

5,867,783 

Testing Data 

2013 MLB World Series Game 5 

1,052,852 

2013 MLB World Series Game 6 

1,026,848 

2013 Honshu Earthquake 

444,018 

2014 NFL Super Bowl 

1,024,367 

2014 FIFA World Cup Third Place 

809,426 

2014 FIFA World Cup Final 

1,166,767 

2014 Iwaki Earthquake 

358,966 

Total 

16,305,443 


seconds, and k = 10. We used these S and k parameters in both 
RawBurst and TokenBurst as well. 

Regarding LABurst’s classifier implementations, we used the 
Scikit-learn 4 Python package for SYMs and RFs as well as an im¬ 
plementation of the ensemble classifier AdaBoost, each of which 
provided a number of hyperparameters to set. For SVMs, the pri¬ 
mary hyperparameter is the type of kernel to use, and initial ex¬ 
periments showed SVMs with linear kernels performed poorly. We 
then applied principal component analysis to reduce the training 
data’s dimensionality to a three-dimensional space for visualiza¬ 
tion. The resulting visualization showed a decision boundary more 
consistent with a sphere rather than a clear linear plane, motivating 
our choice of the radial basis kernel (RBF). 

For the remaining hyperparameters, we constructed separate pa¬ 
rameter grids for SVMs and RFs and performed a distributed grid 
search. The grid for SVM’s two parameters, cost c and kernel coef¬ 
ficient 7 , covered powers of two such that c, 7 = 2 X , x G [- 2 , 10 ]. 
RF parameters were similar for the number of estimators n and 
feature count c such that n — 2 X , x C [ 0 , 10 ] and c = 2 y , 

y e [i,i2]. 

Each parameter set was scored using the AUC metric across a 
randomly split 10 -fold cross-validation set, with the best scores de¬ 
termining the parameters used in our ensemble. We then combined 
the two classifiers using Scikit-learn’s AdaBoost implementation, 
yielding the results shown in Table 5. These grid search results 
show RFs perform better than SVMs, and the AdaBoost ensemble 
outperforms each individual classifier. 

Table 5: Per-Classifier Hyperparameter Scores 


Classifier 

Params 

ROC-AUC 

SVM 

kernel = RBF, 

c — 64, 

7 = 0.0625 

87.48% 

RF 

trees = 1024, 

features = 2 

88.35% 

AdaBoost 

estimators = 2 

89.84% 


4 http://scikit-learn.org/ 


6.2 Ablation Study 

Given the various features from both our own development and 
related works, we should address the relative values or importance 
of each feature to our task. To answer this question, we performed 
an ablation study with a series of classifiers, each excluding a sin¬ 
gle feature set. Each degenerate classifier was then compared with 
the full AdaBoost classifier using the same 10-fold cross-validation 
strategy as above. Table 6 shows each model’s AUC and its differ¬ 
ence with that of the full model. These results suggest the regres¬ 
sion and entropy features contribute the most while the average dif¬ 
ference features seem to hinder performance. 

Table 6 : Ablation Study Results 


Feature Sets 

ROC-AUC 

Difference 

AdaBoost, All Features 

89.84% 

- 

Without Regression 

87.79% 

-2.05 

Without Entropy 

87.94% 

-1.90 

Without TF-IDF 

88.85% 

-0.99 

Without TF-PDF 

89.00% 

-0.84 

Without Density 

89.07% 

-0.77 

Without Inter Arrival 

89.46% 

-0.38 

Without BursT 

89.52% 

-0.31 

Without Average Difference 

90.56% 

0.72 


6.3 Event Discovery Results 

To restate, the first research question (RQ1) posed in this work 
is whether LABurst can perform as well as existing methods in 
detecting key moments. For convenience, we focus on sporting 
competitions, specifically training across several sporting events as 
outlined in Tables 2 and 4, and testing on the final two games of 
the 2013 MLB World Series, the 2014 NFL Super Bowl, and the 
final two matches of the 2014 FIFA World Cup. Prior to present¬ 
ing comprehensive results, we first examine performance curves 
for each sporting competition, as shown in Figure 2. Each graph in 
Figure 2 corresponds to a particular sport, with the blue and green 
lines showing the ROC curves for RawBurst and TokenBurst re¬ 
spectively. The red line shows the ROC curve for the LABurst 
model trained using all features, whereas the black line illustrates 
the LABurst model trained using all but the average difference fea¬ 
ture set. We refer to this restricted version as LABurst*. 

For the 2013 World Series, RawBurst’s AUC is 0.62, Token- 
Burst’s is 0.76, LABurst achieves 0.73, and LABurst* yields 0.76. 
From 2a, the two LABurst models clearly dominate RawBurst and 
exhibit performance on par with TokenBurst. During the Super 
Bowl, RawBurst and TokenBurst achieve an AUC of 0.68 and 0.78 
respectively, while LABurst and LABurst* perform worse with an 
AUC of 0.63 and 0.64, as shown in Figure 2b. During the 2014 
World Cup, both LABurst and LABurst* (AUC = 0.72 and 0.73) 
outperformed both RawBurst (AUC = 0.66) and TokenBurst (AUC 
= 0.64), as seen in Figure 2c. 

6.4 Composite Results 

To compare comprehensive performance, we look to Figure 3, 
which shows ROC curves for all three methods across all three 
testing events. From this figure, we see LABurst (AUC=0.7) and 
LABurst* (AUC=0.71) both outperform RawBurst (AUC=0.65) and 
perform nearly as well as TokenBurst (AUC=0.72). Given these re¬ 
sults, one can answer RQ1 in that, yes, LABurst is competitive 
with existing methods. 

More interestingly, assuming equal cost for false positives and 
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(a) 2013 World Series (b) 2014 Super Bowl (c) 2014 World Cup 


Figure 2: Per-Sport ROC Curves 


negatives and optimizing for the largest difference between true 
positive rate (TPR) and false positive rate (FPR), TokenBurst shows 
a TPR of 0.56 and FPR of 0.14 with a difference of 0.42 at a thresh¬ 
old value of 13.2. LABurst, on the other hand, has a TPR of 0.64 
and FPR of 0.28 with a difference of 0.36 at a threshold value of 
2. From these values, we see LABurst achieves a higher true pos¬ 
itive rate at the cost of a higher false positive rate. This effect is 
possibly explained by the domain-specific nature of our test set and 
TokenBurst implementation, as discussed in more detail in Section 
7.3. 



Figure 3: Composite ROC Curves 


6.5 Earthquake Detection 

Our final research question (RQ2) seeks to determine if adapt¬ 
ing LABurst’s models, as trained on using sporting events listed in 
Tables 2 and 4, can compete with existing techniques in a different 
domain. We explored this adaptation by applying the sports-trained 
LABurst classifier to Twitter data surrounding known earthquake 
events in Japan in 2013 and 2014. 

Figures 4a and 4b show the detection curves for both methods for 
the 2013 and 2014 earthquakes respectively; the red dots indicate 
the earthquake times as reported by the United States Geological 
Survey (USGS). The left vertical axis for each figure reports the 
frequency of the “earthquake” token, and the right axis shows the 
number of tokens classified as bursty by LABurst. From the To¬ 
kenBurst curve, one can see the token “earthquake” sees a signif¬ 
icant increase in usage when the earthquake occurs, and LABurst 
experiences a similar increase simultaneously. It is worth noting 
that LABurst exhibits bursts prior to the earthquake event, but these 
peaks are unrelated to the earthquake event since LABurst does not 
differentiate between the earthquake and other high-impact events 


that could be happening on Twitter. In addition, the peak occur¬ 
ring about 50 minutes after the earthquake on 25 October 2013 
potentially represents an aftershock event 5 . Given the minimal 
lag between LABurst and TokenBurst’s detection, we have shown 
LABurst is effective in cross-domain event discovery (RQ2). 

One can now ask what tokens we identified as bursting when 
the earthquakes occurred. Many of the tokens are in Japanese, and 
tokens at the peak of the earthquake events are shown in Table 7. 
We also extracted several tweets that contain the highest number of 
these tokens for the given time period, a selection of which include, 

“MM ?£&&&&&&&&&&&&&&&&&&&&&:’ 
“4® fi MS-3 ? 4 o e, MMb and “MM 

tz.” Google Translate 6 translates these tweets as “Ah ah ah ah ah 
ah ah ah ah Aa’s earthquake,” “I did not know earthquake because 
not using cheat this time,” and “Over’s earthquake” respectively. 

Table 7: Discovered Bursty Tokens 


Earthquake 

Bursty Tokens 

Honshu, Japan - 25 October 
2013 

? 3t, M, ffi, M, W, 

M TS, * 

Iwaki, Japan - 11 July 2014 

V, H», tf fc\ W, W, W, 

W, 51, J# 


7. ANALYSIS 

In comparing LABurst with the baseline techniques, it is impor¬ 
tant to note the strengths and weaknesses of each baseline: Raw- 
Burst requires no prior information but provides little in the way of 
semantic information regarding detected events, while TokenBurst 
provides this semantic information at the cost of missing unknown 
tokens or significant events that do not conform to its prior knowl¬ 
edge. LABurst attempts to combine these two approaches by sup¬ 
porting undirected event discovery while yielding insight into these 
moments by tagging relevant bursting tokens. 

7.1 Identifying Event-Related Tokens 

As mentioned, where the baselines sacrifice either insight or flex¬ 
ibility, LABurst jointly attacks these problems and yields event- 
related tokens automatically. These tokens may include misspellings, 
colloquialisms, and language-crossing tokens, which makes them 
hard to know a priori. The 2014 World Cup provides an illustrative 

5 http://ds.iris.edu/spud/aftershock/9761021 

6 http://translate.google.com 
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(a) Honshu, Japan Earthquake - 25 October 2013 (b) Iwaki, Japan Earthquake - 11 July 2014 

Figure 4: Japanese Earthquake Detection 


case for such unexpected tokens given its enormous viewership: 
many Twitter users of many different languages are likely tweeting 
about the same event. Table 8 shows a selection of events from the 
final two World Cup matches and a subset of those tokens classi¬ 
fied as bursting during the events (one should note the list is not 
exhaustive owing to formatting and space constraints). 

Table 8: Tokens Classified as Busting During Events 


Match 

Event 

Bursty Tokens 

Brazil v. 

Netherlands, 

12 July 2014 

Netherlands’ 
Van Persie 
scores a goal 
on a penalty 
at 3’, 1-0 

0-1, 1-0, 1:0, 1x0, card, 
goaaaaaaal, goal, gol, goool, 
holandaaaa, kirmizi, pen, pe¬ 
nal, penalti, penalti, persie, 
red 

Brazil v. 

Netherlands, 

12 July 2014 

Brazil’s Os¬ 

car get’s a 
yellow card at 
68’ 

dive, juiz, penalty, ref 

Germany v. 

Argentina, 13 
July 2014 

Germany’s 

Gotze scores 
a goal at 113’, 
1-0 

goaaaaallllllll, goalllll, go- 

dammit, goetze, gollllll, 
gooooool, gotze, gotzeeee, 
gotze, nooo, yessss, K T y 


Several interesting artifacts emerge from this table, first of which 
is that one can get an immediate sense of what happened in the de¬ 
tected moment from tokens our algorithm presents. For instance, 
the prevalence of the token “goal” and its variations clearly indi¬ 
cate a team scored in the first and third events in Table 8; similarly, 
bursting tokens associated with the middle event regarding Oscar’s 
yellow card reflect his penalty for diving. Beyond the pseudo event 
description put forth by the identified tokens, references to div¬ 
ing and specific player/team names in the first and third events are 
also of significant interest. In the first event, one can infer that the 
Netherlands scored since “holandaaaa” is flagged along with “per- 
sie” for the Netherlands’ player, Van Persie, and likewise for Ger¬ 
many’s Gotze in the third event (and the accompanying variations 
of his name). These tokens would be difficult to capture beforehand 
as TokenBurst would require, and such tokens would likely not be 
related to every event or every type of sporting event. 

Finally, the last artifact of note is that the set of bursty tokens dis¬ 
played includes tokens from several different languages: English 
for “goal” and “penalty,” Spanish for “gol” and “penal,” Brazilian 
Portuguese for “juiz” (meaning “referee”), as well as the Arabic 
for “goal” and Japanese for “Germany.” Since these words are se¬ 


mantically similar but syntactically distinct, typical normalization 
schemes could not capture these connections. Instead, capturing 
these words in the baseline would require a pre-specified keyword 
list in all possible languages or a machine translation system capa¬ 
ble of normalizing within different languages (to collapse “goool” 
down to “gol” for example). 

7.2 Discovering Unanticipated Moments 

Results show LABurst is competitive with the domain-specific 
TokenBurst, but TokenBurst’s specificity makes it unable to detect 
unanticipated moments, and we can see instances of such omissions 
in the last game of World Cup. Figure 5 shows target token fre¬ 
quencies for TokenBurst in green and LABurst’s volume of bursty 
tokens in red. From this graph, we can see the first instance in Peak 
#1 where LABurst exhibits a peak missed by TokenBurst. Tokens 
appearing in this peak include “puyol,” “gisele,” and “bundchen,” 
which correspond to former Spanish player Carles Puyol and model 
Gisele Bundchen, who presented the World Cup trophy prior to the 
match. While not necessarily a sports-related event, many viewers 
were interested in the trophy reveal, making it a key moment. At 
peak #2, slightly more than eighty minutes into the data (which is 
sixty minutes into the match), LABurst sees another peak otherwise 
inconspicuous in the TokenBurst curve. Upon further exploration, 
tokens present in this peak refer to Argentina’s substituting Agtiero 
for Lavezzi at the beginning of the match’s second half. 

7.3 Addressing the Super Bowl 

While LABurst performs as well as the domain-specific Token- 
Burst algorithm in both the World Series and World Cup events, 
one cannot ignore its poor performance during the Super Bowl. 
Since LABurst is both language agnostic and domain independent, 
it likely detects additional high-impact events outside of the game 
start/end, score, and penalty events present in our ground truth. 
For instance, during the Super Bowl, spectators tweet about mo¬ 
ments beyond sports plays: they tweet about the half-time show, 
commercials, and massive power outages. Since our ground truth 
disregards such moments, LABurst’s higher false-positive rate is 
less surprising, and TokenBurst’s superior performance might re¬ 
sult from its specificity in domain knowledge with respect to the 
ground truth (i.e., both include only sports data). Hence, LABurst’s 
ability to detect unanticipated moments potentially penalizes it in 
domain-specific tasks. 

LABurst’s propensity towards more organic moments of interest 
becomes obvious when we inspect the tokens LABurst identified 
when it detected a large burst early on that TokenBurst missed. 



































Figure 5: Baseline and LA Bursty Frequencies 


Approximately four minutes before the game started (and there¬ 
fore before when TokenBurst would detect any event), LABurst 
saw a large burst with tokens like “joe”, “namath”, “fur”, “coat”, 
“pimp”, “jacket”, “coin”, and “toss”. As it turns out, Joe Namath, 
an ex football star, garnered significant attention from fans when 
he tossed the coin to decide which team would get first posses¬ 
sion. Since neither our ground truth data nor TokenBurst’s domain 
knowledge captured this moment, LABurst’s detection is counted 
as a false positive much like the trophy presentation during the 
World Cup. 

8. LIMITATIONS AND EXTENSIONS 

The approach adopted herein is fundamentally limited regard¬ 
ing tracking potentially interesting events that do not garner mass 
awareness on social media. Since the LABurst presupposes sig¬ 
nificant bursts in activity during key moments, if only a few peo¬ 
ple are participating in or following an event, LABurst will likely 
be unable to detect moments in that event. This effect is clear in 
applying LABurst to regular season baseball games: since Major 
League Baseball sees over 2,400 games in a season, experiments 
showed too few viewers were posting messages to Twitter during 
these games to generate any significant burst. As a result, many 
key moments in these games are exceedingly difficult to capture 
via burst detection. 

This deficiency leads to a potential opportunity, however, in com¬ 
bining domain knowledge with LABurst’s domain-agnostic foun¬ 
dations. For example, one could apply domain-specific filters to 
the Twitter stream prior to LABurst in the detection pipeline. Since 
LABurst uses relative frequencies to identify bursts, this pre-filtering 
step should amplify the signal of potentially bursty tokens in the 
stream and increase LABurst’s likelihood of detecting them. Re¬ 
turning to the baseball example, one could use domain information 
to filter the Twitter stream to contain only relevant tweets, and the 
baseball-specific key moments should become more apparent. 

In a more interesting case, this domain knowledge could be ap¬ 
plied as events are discovered and allow LABurst to provide more 
insight into those events as they unfold. Examples where such an 
approach could be used include hurricanes, where one can know 
the name of the hurricane and its approximate area of landfall, filter 
the Twitter stream accordingly, and then use LABurst to track the 
unanticipated moments that occur once the storm hits. One could 
apply a similar approach in the early hours of political protests 


or mass unrest to track events that may not be covered by main¬ 
stream news outlets (e.g., in oppressive regimes where media is 
controlled). Additional knowledge such as geolocation data could 
also be integrated into these stream filters to increase LABurst’s 
moment discovery capabilities further. 

9. CONCLUSIONS 

Revisiting motivations, this research sought to demonstrate whether 
LABurst, a streaming, language-agnostic, burst-centric algorithm, 
can discover key moments from unfiltered social media streams 
(specifically Twitter’s public sample stream). Our results show 
temporal features can identify bursty tokens and, using the volume 
of these tokens as an indicator, we can discover key moments across 
a collection of disparate sporting competitions. This approach’s 
performance is competitive with existing baselines. Furthermore, 
these sports-trained models are adaptable to other domains with 
a level of performance exceeding a simple time series baseline and 
rivaling a domain-specific method. LABurst’s performance relative 
to the domain-specific baseline shows this method’s potential given 
its omission of manual keyword selection and prior knowledge. 

Beyond this comparison, our approach also offers notable flex¬ 
ibility in identifying bursting tokens across language boundaries 
and in supporting event description; that is, we can get a sense of 
the occurring event by inspecting bursty tokens returned by LABurst. 
These features combine to form a capable tool for discovering unan¬ 
ticipated moments of high interest, regardless of language. This 
technique is particularly useful for journalists and first responders, 
who have a vested interest in rapidly identifying and understand¬ 
ing high-impact moments, even if a journalist or aid worker is 
not physically present to observe the event. Possibilities also ex¬ 
ist to combine LABurst with other domain-specific solutions and 
yield insight into unanticipated events, events missed by existing 
approaches, or events that might otherwise be lost in the noise. 
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